Model-Free Prediction
Monte-Carlo (MC) Prediction
In Monte-Carlo (MC) prediction, we learn the value function directly from the experience of complete episodes. The update is done at the end of each episode, so there is no bootstrapping. The episode should be complete, meaning the agent should reach a terminal state.
The algorithm is only applicable to episodic tasks, meaning that we need complete episodes to compute the return and update the value function.
Return is the sum of rewards from the current time step to the end of the episode. Return can have discounted rewards, but it is not necessary.
The value of a state under policy
Incremental Monte-Carlo Update
The update rule for the value function is the same as the update rule for the value function approximation. The update rule is based on the difference between the return and the current estimate of the value function.
where
Running mean can be used to update the value function incrementally.
where
Temporal Difference (TD) Prediction
In Temporal Difference (TD) prediction, we learn the value function by bootstrapping from the current estimate of the value function. The update is done at each time step, so there is bootstrapping. The episode does not need to be complete, meaning the agent can learn from incomplete sequences.
TD predictions is a guess towards a guess.
Incremental Every-Visit MC:
TD(0):
Here, we can see the difference between Monte-Carlo and Temporal Difference prediction. In Monte-Carlo prediction, the value function is updated at the end of the episode, whereas in Temporal Difference prediction, the value function is updated at each time step. The Monte-Carlo prediction is more accurate but has higher variance, whereas the Temporal Difference prediction is less accurate but has lower variance.
is the unbiased estimate of the
Hence, TD target
However, the TD target
Unified View
Monte-Carlo Backup:
Temporal Difference Backup:
Dynamic Programming Backup:
Bootstrapping and Sampling
Bootstrapping: Update the value function based on the current estimate of the value function.
- MC does not bootstrap
- TD bootstraps
- DP bootstraps
Sampling: Update the value function based on the actual returns.
- MC samples
- TD samples
- DP does not sample
TD( ) Prediction with Forward View
The
Like MC the forward view of TD(
TD( ) Prediction with Backward View
Forward view provides the theoretical foundation for TD(
Backward view provides the mechanism to compute the
Eligibility Traces
Eligibility traces are a key concept in reinforcement learning that help bridge the gap between Monte Carlo methods and Temporal Difference (TD) learning. They allow for more efficient learning by combining the strengths of both approaches. When used with policy gradients, eligibility traces can help improve the learning process in policy-based reinforcement learning algorithms.
- Frequency Heuristic: The more frequently a state is visited, the more important it is.
- Recency Heuristic: The more recently a state is visited, the more important it is.
If the
updates are done after each time step, online updates If the updates are accumulated and done at the end of the episode, offline updates
#MMI706 - Reinforcement Learning at METU